24 research outputs found

    Energy‐aware strategies for task‐parallel sparse linear system solvers

    Get PDF
    This is the pre-peer reviewed version of the following article: Energy‐aware strategies for task‐parallel sparse linear system solvers, which has been published in final form at https://doi.org/10.1002/cpe.4633. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions.We present several energy‐aware strategies to improve the energy efficiency of a task‐parallel preconditioned Conjugate Gradient (PCG) iterative solver on a Haswell‐EP Intel Xeon. These techniques leverage the power‐saving states of the processor, promoting the hardware into a more energy‐efficient C‐state and modifying the CPU frequency (P‐states of the processors) of some operations of the PCG. We demonstrate that the application of these strategies during the main operations of the iterative solver can reduce its energy consumption considerably, especially for memory‐bound computations

    Paralelización del calculo del numero pi utilizando librerías de precisión múltiple en un clúster de computadores

    Get PDF
    Ponència presentada en el XXXII Jornadas de Paralelismo (JP2022) y VI Jornadas de Computación Empotrada y Reconfigurable (JCER2022) SARTECO 2022La Computación de Altas Prestaciones (CAP), o High Performance Computing (HPC), es un campo de la Ingeniería Informática que tiene como principal objetivo extraer el mejor rendimiento de los recursos disponibles de un computador para la resolución de un problema informático. Alcanzar este objetivo requiere tener un conocimiento detallado de la arquitectura y del sistema operativo, así como de los algoritmos y los lenguajes de programación que permiten implementar códigos más eficientes. En este artículo se presenta la evaluación de dos librerías de precisión múltiple para CPU y cómo éstas se pueden utilizar en procesadores multinúcleo y en multicomputadores o clusters de procesadores multinúcleo para el cálculo de la constante numérica π

    Characterization of Multicore Architectures using Task-Parallel ILU-type Preconditioned CG Solvers

    Get PDF
    Ponència presentada al 2nd Workshop on Power-Aware Computing (PACO 2017) Ringberg Castle, Germany, July, 5-8 2017We investigate the eficiency of state-of-the-art multicore processors using a multi-threaded task-parallel implementation of the Conjugate Gradient (CG) method, accelerated with an incomplete LU (ILU) preconditioner. Concretely, we analyze multicore architectures with distinct designs and market targets to compare their parallel performance and energy eficiency

    Harvesting Energy in ILUPACK via Slack Elimination

    Get PDF
    Ponència presentada al 2nd Workshop on Power-Aware Computing (PACO 2017) Ringberg Castle, Germany, July, 5-8 2017We develop a new energy-aware methodology to improve the energy consumption of a task-parallel preconditioned Conjugate Gradient iter- ative solver on a Haswell-EP Intel Xeon. This technique leverages the power-saving modes of the processor and the frequency range of the userspace Linux governor, modifying the CPU frequency for some oper- ations. We demonstrate that its application during the main operations of the PCG solver can reduce its energy consumption

    iMODS: internal coordinates normal mode analysis server

    Get PDF
    Normal mode analysis (NMA) in internal (dihedral) coordinates naturally reproduces the collective functional motions of biological macromolecules. iMODS facilitates the exploration of such modes and generates feasible transition pathways between two homologous structures, even with large macromolecules. The distinctive internal coordinate formulation improves the efficiency of NMA and extends its applicability while implicitly maintaining stereochemistry. Vibrational analysis, motion animationś and morphing trajectories can be easily carried out at different resolution scales almost interactively. The server is versatile; non-specialists can rapidly characterize potential conformational changes, whereas advanced users can customize the model resolution with multiple coarse-grained atomic representations and elastic network potentials. iMODS supports advanced visualization capabilities for illustrating collective motions, including an improved affine-model-based arrow representation of domain dynamics. The generated all-heavy-atoms conformations can be used to introduce flexibility for more advanced modeling or sampling strategies.Human Frontier Science Program—RGP0039/2008, Ministerio de Economía y Competitividad—BFU2013-44306P and Comunidad de Madrid—CAM-S2010/BMD

    Iteration-fusing conjugate gradient for sparse linear systems with MPI + OmpSs

    Get PDF
    In this paper, we target the parallel solution of sparse linear systems via iterative Krylov subspace-based method enhanced with a block-Jacobi preconditioner on a cluster of multicore processors. In order to tackle large-scale problems, we develop task-parallel implementations of the preconditioned conjugate gradient method that improve the interoperability between the message-passing interface and OmpSs programming models. Specifically, we progressively integrate several communication-reduction and iteration-fusing strategies into the initial code, obtaining more efficient versions of the method. For all these implementations, we analyze the communication patterns and perform a comparative analysis of their performance and scalability on a cluster consisting of 32 nodes with 24 cores each. The experimental analysis shows that the techniques described in the paper outperform the classical method by a margin that varies between 6 and 48%, depending on the evaluation

    Dynamic spawning of MPI processes applied to malleability

    Get PDF
    Malleability allows computing facilities to adapt their workloads through resource management systems to maximize the throughput of the facility and the efficiency of the executed jobs. This technique is based on reconfiguring a job to a different resource amount during execution and then continuing with it. One of the stages of malleability is the dynamic spawning of processes in execution time, where different decisions in this stage will affect how the next stage of data redistribution is performed, which is the most time-consuming stage. This paper describes different methods and strategies, defining eight different alternatives to spawn processes dynamically and indicates which one should be used depending on whether a strong or weak scaling application is being used. In addition, it is described for both types of applications which strategies benefit most the application performance or the system productivity. The results show that reducing the number of spawning processes by reusing the older ones can reduce reconfiguration time compared to the classical method by up to 2.6 times for expanding and up to 36 times for shrinking. Furthermore, the asynchronous strategy requires analysing the impact of oversubscription on application performance.This work has been funded by the following projects: project PID2020-113656RB-C21 supported by MCIN/AEI/10.13039/501100011033 and project UJI-B2019-36 supported by UniversitatJaume I. Researcher S. Iserte was supported by the postdoctoralfellowship APOSTD/2020/026, and researcher I. Martín- Álvarez was supported by the predoctoral fellowship ACIF/2021/260, both from Valencian Region Government and European Social Funds.Peer ReviewedPostprint (author's final draft

    Communication in task-parallel ILU-preconditioned CG solversusing MPI + OmpSs

    Get PDF
    We target the parallel solution of sparse linear systems via iterative Krylov subspace–based methods enhanced with incomplete LU (ILU)-type preconditioners on clusters of multicore processors. In order to tackle large-scale problems, we develop task-parallel implementations of the classical iteration for the CG method, accelerated via ILUPACK and ILU(0) preconditioners, using MPI + OmpSs. In addition, we integrate several communication-avoiding strategies into the codes, including the butterfly communication scheme and Eijkhout's formulation of the CG method. For all these implementations, we analyze the communication patterns and perform a comparative analysis of their performance and scalability on a cluster consisting of 16 nodes, with 16 cores each

    Characterizing the efficiency of multicore and manycore processors for the solution of sparse linear systems

    Get PDF
    We analyze the efficiency of servers equipped with state-of-the-art general-purpose multicore processors as well as platforms based on accelerators such as graphics processing units (GPUs) and the Intel Xeon Phi. Following the proposal recently advocated in the High Performance Conjugate Gradient (HPCG) benchmark, we leverage for this purpose efficient implementations of ILUPACK, a preconditioned solver for sparse linear systems that comprises numerical kernels and data access patterns analogous to those of HPCG. Our study analyzes the (computational) performance and energy efficiency, with two different metrics for each: time/floating-point throughput for the former; and energy/floating-point throughput-per-Watt for the latter.This work was supported by the CICYT project TIN2011-23283 of MINECO and FEDER, and the EU Project FP7 318793 “EXA2GREEN”

    Balanced and Compressed Coordinate Layout for the Sparse Matrix-Vector Product on GPUs

    Get PDF
    We contribute to the optimization of the sparse matrix-vector product on graphics processing units by introducing a variant of the coordinate sparse matrix layout that compresses the integer representation of the matrix indices. In addition, we employ a look-ahead table to avoid the storage of repeated numerical values in the sparse matrix, yielding a more compact data representation that is easier to maintain in the cache. Our evaluation on the two most recent generations of NVIDIA GPUs, the V100 and the A100 architectures, shows considerable performance improvements over the kernels for the sparse matrix-vector product in cuSPARSE (CUDA 11.0.167).This work was partially sponsored by the EU H2020 project 732631 OPRECOMP and project TIN2017-82972-R of the Spanish MINECO. Hartwig Anzt and Yuhsiang M. Tsai were supported by the “Impuls und Vernetzungsfond” of the Helmholtz Association under grant VH-NG-1241 and by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. The authors would like to thank the Steinbuch Centre for Computing (SCC) of the Karlsruhe Institute of Technology for providing access to an NVIDIA A100 GPU
    corecore